Multi-Lingual Sentiment Analysis Documentation

1. Executive Summary

The Multi-Lingual Sentiment Analysis Pipeline project delivers an end-to-end NLP solution for accurate sentiment detection (positive/negative/neutral) across 12+, handling informal text, emojis, slang, and code-mixing. It fine-tunes XLM-RoBERTa on diverse multilingual datasets, includes advanced preprocessing modules, compares deep learning vs. rule-based baselines, and provides an interactive Streamlit dashboard for real-time and batch analysis. The system achieves 89% average accuracy/F1, outperforms rule-based by 25% on complex inputs, reduces analysis time by 70%, and was completed over 8.5 months from March to November 2025 for global social media and customer feedback applications.

2. Architecture Overview

The architecture follows a modular pipeline: input text undergoes preprocessing (emoji conversion, slang normalization, code-mix detection), feeds into fine-tuned XLM-RoBERTa for inference, optionally routes to rule-based baseline for comparison, and outputs results with confidence scores via Streamlit dashboard. This design ensures language-agnostic handling, real-time visualization (pie charts, distributions), batch processing, and easy deployment, focusing on 12 languages with transfer learning for low-resource ones.

3. Technology Stack

The system uses HuggingFace Transformers for XLM-RoBERTa fine-tuning and inference, Python for scripting and preprocessing (emoji, re, langdetect, NLTK), and Streamlit for the interactive web dashboard. Additional libraries include torch for training, datasets from HuggingFace Hub, and VADER/adapted lexicons for rule-based comparison; supports GPU acceleration and caching for efficiency.

4. Sentiment Model and Features

The core model fine-tunes XLM-RoBERTa-base with a classification head (cross-entropy loss, 2e-5 LR, 5 epochs) on multilingual datasets (ML-SENT, Twitter/reviews). Features include emoji-to-text (demojize), slang dictionaries/regex per language, code-mix segmentation with langdetect and per-segment stops. Rule-based baseline uses adapted VADER/lexicons for comparison, with DL achieving 89% F1 vs. rule-based 60-65% on code-mixed/nuanced text, plus confidence scores and visualizations.

5. Data Processing

Data processing curates multilingual corpora from HuggingFace Datasets/Twitter (12 languages), preprocesses with emoji handling, slang normalization (dicts/regex), code-mix detection (langdetect + segmentation), stopword removal, and augmentation for low-resource languages. Fine-tuning uses tokenized inputs (max 512), inference applies the same pipeline, with caching for dashboard speed; handles 100+ queries/min and batch uploads.

6. Project Timeline (8.5 Months)

  • 📅 Month 1: Planning & Data (Curate datasets, define languages).
  • 📅 Month 1.5-3: Preprocessing (Build emoji/slang/code-mix modules).
  • 📅 Month 3-5.5: Model Fine-Tuning (Train XLM-R, implement rule-based baseline).
  • 📅 Month 5.5-7: Dashboard (Develop Streamlit app with visuals).
  • 📅 Month 7-8: Evaluation & Comparison (Test metrics, generate report).
  • 📅 Month 8-8.5: Deployment & Handover (Host dashboard, provide training).

7. Testing & Deployment

Testing includes unit for preprocessing modules, integration for pipeline flow, accuracy/F1 on held-out multilingual benchmarks (>85%), and usability for dashboard (batch/real-time). Deployment hosts Streamlit on Sharing/Heroku with caching/async, uses phased rollout with GPU options, and supports rollback via model versioning if issues arise.

8. Monitoring & Maintenance

Post-deployment, monitor inference latency/accuracy via Streamlit logs, periodic re-fine-tuning on new data, and dashboard usage, aiming for >99% uptime and <2s responses. Maintenance includes quarterly updates for slang dictionaries/languages, monthly performance audits, and cost controls (caching, CPU fallback), with alerts for low-confidence predictions.

9. Roles & Responsibilities

  • 📂 Data Engineers: Curate datasets and preprocessing pipelines.
  • 🧠 NLP Engineers: Fine-tune XLM-RoBERTa and rule-based models.
  • 💻 Full-Stack Developers: Build the Streamlit dashboard interface.
  • 🚀 DevOps: Handles deployment, GPU acceleration, and caching.
  • 💼 Project Manager: Oversees Agile sprints and client refinements.